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The ultimate learning machine 


As the machine studies the data, 
it teaches us and we are the learners. 


Overly condensed version of lecture 


Given binary or category data, any nonparametric 
statistical learning machine that is consistent for the 
regression problem will generate consistent probability 
estimates. 


Given consistency, the machine will also be Brier score 
consistent. 


JD Malley, J Kruppa, A Dasgupta, KG Malley, A 
Ziegler (2011). Probability Machines: Consistent 
Probability Estimation Using Nonparametric Learning 
Machines. Methods of Information in Medicine 


Another view of learning machines 


Outline 


Background 
Machine Mythology: Learning and Unlearning Machines 
Strategies for Learning Probabilities 


Research Questions 


Background Resources 


A Probabilistic Theory of Pattern Recognition 
Luc Devroye, László Gyórfi, Gábor Lugosi 
(Springer, 1996) - DGL 


A Distribution-Free Theory of Nonparametric Regression 
László Gyórfi, Michael Kohler, Adam Krzyżak, Harro Walk 
(Springer, 2002) - GKKW 


Additional Resources 


Regression Modeling Strategies 
Frank E. Harrell, Jr. 
(Springer, 2001) 


Statistical Learning for Biomedical Data 
James Malley, Karen Malley, Sinisa Pajevic 


(Cambridge University Press, 2011) 


(plus about 100 other perfectly good books) 


The Bayes decision machine 


The minimum Bayes loss rule is defined using the 
conditional probability function y(x) 


n(x) = Pr[Y = 11 X =x] 
And the Bayes machine decision (rule) g* is 


g*(x) =1 if y(x) > ^ 
=O if n(x) c &% 


The Bayes decision rule is best 


No other decision rule (machine) can 1n principle have a lower 
Bayes error loss than g*(x) 


But we never know the true probability function y(x) 
So we can't directly define g*(x) 


It can be hard to estimate y(x) not knowing any details 
about the distribution of the data 


However, it's not necessary to accurately estimate n(x) 
to get good binary decision rules 


Classification is considered much easier than probability 
estimation, but this is in fact not exactly true. 


Details for Classification 
Given training data D, with sample size n of training 
vectors (x1, Y1), (Xo. Yo), . . . Xn Yn) 


For X = x; a d-dimensional vector of features (attributes) 
Y= y; a binary outcome Y 20,1 


Goal: predict new Y given new X 
with machine (rule) g(X): g(X) 20,1 
generated using training data D, : g,(X) = (X; D,) 


Evaluate machine using Bayes error (loss) L, 


L(g) - Pr[ g,(X) £ Y, given training data D, | 


Unlearning about Learning Machines 


Misinformation and misunderstanding about the 
subject is wide-spread. 


Groups of very smart people don't talk to each 
other as much as one would predict. 


Still, we all have an essential, hard-earned right to 
remain ignorant. 


Myth & There exists a super machine that will 
classify all data really well 


FACT: Given machine A there always exists a universally good 
machine B and a distribution D such that: 


1. For data D the true Bayes error probability is exactly zero 


2. Machine A has error probability > Machine B error probability 
(and for all sample sizes from D) 


Some Cautions 


Given a machine A there is not necessarily a 
machine B that is better for every data set. 


Given a machine A and a machine B there 1s not 
necessarily a data set such that B is better than A 
on that data. 


Both of these related —but not equivalent— 
assertions are open research questions. 


. Need basic probability and advanced combinatorial 

methods to make progress. See DGL, GKKW 
converges in probability 2 Bayes consistent 
converges a.e. — strongly Bayes consistent 


Myth #2 A machine must be complex and really 
quite clever in order to be very good 


FACT: There are many simple, practical machines that 
are very good. 


Key example: The /-nearest neighbor machine (/-NN) 
requires no training, no tuning, no hard-won optimization. 
And yet for data with true Bayes error L the Bayes loss L,, is 


Dub m 


So if the true Bayes error is small then 
without any further work, 
the Bayes loss for /-NN is also small. 


Technical details about Myth #2 


Note that the apparent (observed) estimated error on the 
training data D, for the /-NN machine is always exactly zero. 


It is not magical or mysterious for a machine to have this 
property. The logitboost machine can have this property. 


So, it does not necessarily mean that if we are overfitting the 
data then we can't make good predictions on new data. 


Also, how we estimate error on the training data D, is 
important. 


More about unlearning Myth #2 


Nature is not particularly interested in right or wrong; 
She does not require deep and complicated; 
She only asks that we listen. 


Myth #3.1 A good machine needs very little data 


FACT: For any good machine there is a data set D, such 
that the machine error is far above its large sample 
Bayes error 


Given any small constant c there is a data set such that at 
any sample of size n the Bayes error is in the interval 


Lc 


That is, the machine gets stuck arbitrarily close to 
coin-tossing. See DGL, Chapter 6 


Myth #3.2 A good machine needs very little data 


FACT: Given a good machine there exists data such that 
the estimated Bayes error converges arbitrarily slowly 
to true large sample lower limit. 


1. This is true even for universally strongly Bayes consistent 
machines. 


2. The Bayes error can be held above any decreasing 
sequence, at every n. 


3. These facts should be disturbing. 


Myth Z4 A weak machine must be abandoned 


FACT: Many weak machines (all with high error) can be 
combined to generate a provably very 
good machine (low error). 


1. Basic idea goes back to Condorcet (1785!) 


2. Weak, provably inconsistent machines may be very strong 
when decisions are pooled (Gérard Biau et al., 2008). 


3. Committee decisions are a key part of the Random Forest 
and bagged nearest neighbor machines. 


4. The method of Mojirsheibani (1999) goes further. . . 


A superior committee decision method 
for these uncertain times 


Mojirsheibani (1999, 2002) showed that any collection 
of machines could be pooled to get: 


1. A group machine that is as least as good as the 
strongest machine in the set. 

2. A group machine that is Bayes optimal if any 
machine in the group is Bayes optimal. 

3. And we don't need to know which machine is 
which in (1) or (2). 

4. Method is large sample optimal — see all the 
cautions above about small or fixed sample sizes. 

5. Method is a kind of single decision tree. 


More on unlearning Myth #4 


How full of inconsistencies and absurdities it is. I declare 
that taking the average of many minds that have recently 
come before me, I should prefer the obedience, affections 
and instinct of a dog before it. 

Michael Faraday 


Myth #5 Finding optimal parameter estimates 


is equivalent to finding good decision 
machines 


FACT: For the binary classification problem the real issue is 


l. 


getting good (0, 1} predictions and low Bayes error. 


This is not the same as finding the minimum squared error 
for any parameters in the machine code. 


DGL has key examples showing that MSE can be 
minimized and yet the resulting decision machine has 
large error probability, far from minimum. 


Myth #6 A good binary machine works because 


it is a good estimator of the group probability 


FACT: Simple examples show otherwise. 


l. 


A good probability machine is certainly a good binary 
decision machine. 


But an excellent binary rule can be very not good as a 
probability estimator. 


. Basic idea: a good decision machine only has to get a 


probability estimate that is on the same side of the 
decision boundary as the Bayes probability n(x). 
It can be very different from the Bayes probability. 


More on Myth 26... 


. Good decision machines like logitboost and Random Forest 
do not natively, directly estimate the group probability. 


. It might be possible to re-engineer logitboost (LB), for 
probability estimation but this is not certain. 


. But it is very easy to re-engineer Random Forest to get valid 
probability estimates. 


. There is an evident trade-off for probability estimation using 
a support vector machine: sparsity vs. accuracy. 
Bartlett and Tewari, 2007. 


more on all this in a few moments. . . 


Myth #7 A good machine requires careful tuning 
to work well 


FACT: Good machines need only some tuning, not much. 
Most of the necessary tuning rules just track the 
sample size. 


Neural nets: the number of nodes, k, for the first hidden layer 
and with the sigmoid threshold, should grow but not too fast: 
k e vn 


Nearest neighbors: number of neighbors, k, should have 
k/n going to zero, as n goes to infinity 


Single decision trees: sample size, k,, of the smallest cell 
(terminal node) should have 
k,/ logn going to infinity, as n goes to infinity 


l. 


More on Myth #7 


Sharper results and improved tuning requires more 
technique, especially 


Vapnik-Chervonenkis (VC) dimension 


. VC upper and lower bounds provide optimal probability 


statements about Bayes error and these are known exactly 
for many machines. 


The VC dimension is a measure of the flexibility of a 
machine. It needs to be high but not too high and scale 
with the sample size. 


Otherwise the machine will overfit: do well on sample, 
but not do well on test data. 


Still more on Myth 77 


We ignore the VC dimension at our own peril. 


It should not be a theoretical swamp to be avoided 
by statisticians. 


It is nearly as important for practical reasons as 
the bootstrap. 


Myth 48 A machine must act as a global device 
in order to be good 


FACT: A Bayes consistent machine must be local, 
and need be only weakly global. 


1. This is recent work by Zakai and Ritov, 2008 


2. Locality seems obvious for a nearest neighbor machine 
or decision tree, but it also holds for support vector 
machines or boosting (when either are consistent) 


3. The technical definition of /ocal must be made precise, 
but it basically means that the machine doesn't need to 
see data far from a test point. 


More about unlearning Myth #8 


Information 1s local and collective; 
only rarely global or singular. 


Myth Z9 There must exist some unique small set 
of most predictive features 


FACT: Many simple examples show the nonuniqueness of 
important feature sets. 


1. Good models need not be unique. Or, it might take infinite 
amounts of data to detect any difference between them. 


2. Biological processes are typically not unique or singular. 
Nature does not function from singularity. 


3. Using relentlessly univariate methods over large feature sets 
is provably mistaken; See DGL for examples. 


More on Myth #8 


When Nature has something to tell us She will say 
the same thing in at least three languages. 


(The Augmented Rosetta Stone Principle) 


Myth #9 Competing sets of important features 
can be nicely ranked and combined 


FACT: Distinct lists of important features cannot always be 
combined and maintain logical consistency 

1. Already known to Condorcet, 1785! 

2. Relates to the voting paradox of Arrow, 1951 


3. Carefully studied by Saari and Haunsperger, 1991, 1992, 
2003 


4. It might be possible to use a kind of probabilistic ranking 
instead; this 1s a research question 


Summarizing what we have unlearned 


. Benchmarking over a set of machines on several data sets 
— to find a single grand super machine good for all data— 
is provably mistaken. 


. It is informative to look at a set of machines on a single 
data set to see how they behave on that data: machine 
forensics 


. Choosing a winner is often just unnecessary, since a 
committee will usually do better even with very weak 
machines, and... 


. Random Forest and Random Jungle are good native 
committee machines, but groups can be formed over 
many very different machines. 


More unlearning 


A good binary rule is not the same as a good group 
probability estimator— nor does it have to be. 


Good machines need some tuning but not much if there 
is any signal in the data. 


Multiple predictive feature sets can usually be identified 
but not uniquely so, and almost never with relentlessly 
univariate screening. 


Research Questions 


. Find solutions for merging multiple lists of features 


. Find methods for network and clique detection among 
entangled, weakly predictive features 


. Find good machines that handle missing data without 
imputation; 
See Mojirsheibani and Montazeri, JRSS-B, 2007 


. Find good probability estimating machines. . . 


Probability Estimation 


Probability estimates provide much more information 
than simple binary or category classification 


Central part of personalized medicine 


Probability estimating machine requirements: 


Assume only binary (or category) outcomes 
(the Y values) 


Must be completely nonparametric 
no assumptions about the structure 
or number of the features (the X values) 
no distribution or correlation assumptions 


More probability machine requirements 


Must be (at least) provably consistent: 
minimize expected MSE in the limit of large data 


Should have generally good convergence rates. 
Do such devices — probability machines — exist? 
Yes, in abundance, but not named as such... 


Any nonparametric regression machine that is 
provably consistent is a probability machine. 


Solving the regression problem? 
The original Bayes problem for classification: 
estimate the conditional probability function 
n(x) = Pr[Y = ll X =x]. 


But this is not necessary for classification (!), 
and may be quite hard (?). 


On the other hand for arbitrary function f(x) estimating 
E[f(x)| X = x] 


is the basic regression problem: Given data X estimate the 
expectation of Y. That is if f(x) = Y = (0,1) then 


E[f(x) X = x] = Pr[Y = 11 X 2 x] = (x) 


Therefore: 


Any nonparametric learning machine consistent for the 
regression problem solves the probability machine problem. 


Any probability machine will also work for multi-category 
outcomes, Y=1,2,...,m. 
(but this is not exactly obvious) 


Basic nonparametric regression idea we like: 
take average of zeros and ones for points near the test point 
(or in the same terminal cell) 


Possible Probability Machines 


Consider: 


Random Forests 
k-nearest neighbors 
bagged nearest neighbors 
neural nets 

support vector machines 
kernel methods 

boosting 


Some versions of these have been shown to be regression 
consistent; these can operate as probability machines. 


Other versions have unknown consistency properties, 
and make us visibly anxious. 


Two primitive but good machines 


Random forest using random splits 


Nearest neighbors 


Both are universally strongly consistent 


More evolved versions of Random Forest are also 
consistent, and very fast. 


Bagged nearest neighbors also consistent, 
but not very fast. 


Some technical details 


Random Forests, bagged nearest neighbors 
Biau, Cerou, Guyader, 2010 
Biau, Devroye, 2010 
Biau, 2010 
Biau, Devroye, Lugosi, 2008 


Neural nets, kernel methods, generalized data partitioning methods 
GKKW, 2002 and many other sources 


Support vector machines 
Consistency proofs for SVM regression appear to make many 
hard-to-evaluate technical assumptions. 
Christmann, Van Messem, 2008 


The multiclass problem for SVM regression is a binning and 
bracketing approach; far from a direct estimate; and at least one 
proof is possibly incorrect. 

Glasmachers, 2011 


More on consistency 


1. Unmodified logitboost is unlikely to be consistent 
Mease et al. 2007, 2008 


2. The prob option in Random Forest has unknown 
consistency properties 


3. Support vector machines may be regularized to gain 
consistency, but in general directly transforming the 
output function does not guarantee consistency. 

Bartlett, Tewari, 2007 


Validating any probability machine 


Converting the machine into a classifier and then 
checking the observed error rate is not efficient 


Treating the probability machine as a classifier and then 
checking the ROC curve for a probability machine is 
really not good 
The classification rates (sensitivity and specificity) 
do not define proper scores 


Weather forecasters and decision theorists have used other 
methods for years: 


The Brier score is the observed MSE given the 
binary outcomes (Y — 0, 1) 
and the estimated probability f(x). 


More on validating the machine 


The basic Brier score on training data is not obviously a 
consistent or efficient estimate for probability forecasting. 


However, we can show: 
Theorem (Malley, 2011). If a probability machine is consistent 


then the Brier score 1s also consistent. 


Bagging the Brier score on the training data should be good 
also. 


Some machines and some examples 


The machines 
Random forest in regression and 
classification mode 
nearest neighbors, bagged nearest neighbors 
logitboost 
logistic regression 


The examples 
Mease plateau model (2007) 
From UCI database: 
Sonar 
Pima Indian diabetes 
Appendicitis 


Mease synthetic data 


Mease et al. (2007, 2008): 
Two-dimensional circle simulation 

X is uniform on the square [0, 50] x [0, 50] 
Y=0,1 

Distance from (25,25) 1s d(x): 


P(y21l3) 21 d(9s8 
-[28-d(x)/20 8 < d(x) < 20 
-0 dw>28. 


Generate 5000 instances 


Sonar data 


Sonar, Mines vs. Rocks 
UCI machine learning database 


208 records, 60 features 

Feature coefficients generated from N(2, 0.2) 
Linear function centered using an intercept term 
Probabilities generated using the expit function 


Binary outcomes generated from a binomial random 
number generator with the corresponding probabilities 


Prediction 


Prediction 


Prediction 


Results for Mease data 


b-NN 


classRF 


k-NN 


04 
Iboost 


0.2 
0.0 


0.2 
0.0 


T T T T T T 
0.0 0.2 0.4 0.6 0.8 1.0 
True probability 


0.0 0.2 0.4 0.6 0.8 1.0 
True probability 
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More on Mease data 


t 
|t 


Prediction 


Prediction 


Prediction 


Results for Sonar data 


b-NN classRF 
E 10 + 
| | | 08 - 3 
| | eem 0.6 + | il 
J | sala Is 04 + a | 
4 | 0.2 + 
Ly T T T : T T 0.0 4 T T T T : T T 
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 
k-NN Iboost 
] i ' 1.0 + i 
J | | 08 - | É 
- a 06 - | | 
J ; i 0.4 + ' 
j ! 0.2 4 | i 
q T T T T T T 0.0 + T T T T T T 
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 
logreg regRF 
| 1.0 ' 
0.8 | 
| 0.6 Lee 
à 0.4 à 
; 0.2 ' ' 
T T - T T : T T 0.0 T T T T : T T 
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 
True probability True probability 
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More on Sonar data 


! JH. 


Real data: Appendicitis 


The appendicitis data set is from Marchand et al., (1983). 


Before surgery eight laboratory tests made for diagnosis 
of acute appendicitis 


Following surgery 85 out of 106 patients were confirmed 
by biopsy to have had appendicitis: 


2] out of 106 did not need the surgery! 


Real data: Pima Indian diabetes 


UCI machine learning database 
Pima-Indian women at least 21 years old 


768 patients: 268 with diabetes, 500 without 
Binary outcome Y= 0,1. 


Eight features: 

Number of times pregnant, plasma glucose concentration 
at 2 hours in an oral glucose tolerance test, diastolic blood 
pressure (mm Hg), triceps skin fold thickness (mm), 2- 
hour serum insulin (mu U/ml), body mass index (kg/m?), 
diabetes pedigree function, and age (years). 


Appendicitis data 


b-NN Bagged - se — | cm e 
k-NN + co — | }— 6 senso 


LogitBoost - e ee 
LogReg 4 e e cm 000 © 
classRF + © mm oo o 
regRF + 00 e» c 
I I I 
0.05 0.10 0.15 
Brier score 
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Pima Indian Diabetes 


b-NN Bagged 


k-NN 


LogitBoost 


LogReg 


classRF 


regRF 


0.14 0.16 0.18 0.20 
Brier score 
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Probability Start Site Estimation 
for Whole Genomes 


B Oliver (NIH, NIDDK) 

A Dasgupta (NIH, NIAMS) 
A Fletcher (NIH, CIT) 

K Malley (MRP, Inc.) 

J Malley (NIH,CIT) 


We begin with Drosophila melanogaster 
165 million bases; unknown number of start sites 


75% of known human disease genes have a match 
in the fruit fly genome 


Probability Machines for promoter start sites 


* Givena collection of N known strings of length m, 
each containing a transcription start site, generate 
N strings of random bases, each of length m. 


* Train a probability machine on this data, report 
error rates, probability plots 
We began with 10,0004 known start site strings of 
length 1,500, and start site at 1,000 


Features in the Forest 


* Used tetramers across the entire strings 

* Random Forests ranks the importance of these 
features: most important are very close to true 
start position 

* Errorrates did not greatly improve using 
more complex features (pairs or triples of 
separated tetramers) 


Early Results 


1. Using Random Forests, and where non-starts 
are purely random, the error rate for the true/ 
not-true string is about 20%: 


2 

E E 

S True status 
E 

a o EU 
B 

5 e 1 
EI 

9 
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1 1 1 1 
5000 10000 15000 20000 
Index 


3000 — 
2500 — 
2000 — 
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E 
0 
3 1500 — 
o 
1 
1000 — 
500 - 
0- 


00 02 04 06 08 10 
Predicted probabilities 


Random Forests on the training data 


8- 

6- 

True status 

B 
D 0 
c 
© 
94. 1 

2- 

0- 


02 04 06 08 
Predicted probabilities 


Further experiments 
2. Using wild type non-starts from the genome the 
error climbs to about 37% (not pretty) 


Using only high probability true sites and wild type 
non-starts leads to 100% recovery of the start sites 


on Chromosome 2R: 


21 strong start sites on 2R 
among 21+ million bases 
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Refinements and Extensions 


* Continue using Random Forests as a probability 
machine and identify moderately strong start sites 


* Create a probability profile for the whole genome 


* Generate probability machines for other 
Drosophila species (30+) and also 


C. elegans: 90 million bases, 
Zebrafish (Danio rerio): 1.7 billion bases 
Humans: 3 billion bases 


Extensions of probability machines 


Multi-category outcomes (done, immediate) 
Matched case-control data (partly done, family trios) 


Survival profiles at multiple time points 
(under right censoring; no hazard model required) 


Prognostic outcome trajectories, interventions 
Probability manifolds: geodesics for clinical planning 
Probability network detection 


Probability ranking: Prob[feature Z is in top100 list] 
this dissolves the Condorcet preference paradox 


Conclusions 


Any nonparametric regression learning machines that is 
known to be consistent is a valid probability machine. 


There are simple learning machines that are consistent: 
random forest, bagged nearest neighbor 


More complex machines may also be consistent (or not). 


For validating probabilities the Brier score is consistent when 
the machine is consistent. It provides transparent evaluation of 
any probability machine. 


Touring the Biowulf Cluster at NIH 
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